Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitHub Action for Nextclade annotations #158

Merged
merged 7 commits into from
Apr 1, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 25, 2024

Description of proposed changes

Adds rules, config, and GitHub Action file to support running Nextclade on all available HA and NA sequences for H1N1pdm, H3N2, and Vic.

The new Snakemake logic lives in a custom build config named nextclade which skips most of the standard workflow build logic, using only the "download from S3" rules and its own custom rules to run Nextclade for the lineages and segments defined in the build config. This custom logic runs Nextclade with the default dataset per lineage and segment just like the flu_frequencies workflow and uploads the results to S3. Once we have merged this PR, we should be able to automatically run Nextclade with each ingest of new data and run flu_frequencies from the resulting files on S3.

This PR includes a couple of minor changes to other parts of the standard workflow to allow the Nextclade build config YAML to be as simple as possible and also to allow all workflows to download the parsed sequences and metadata from S3 instead of downloading the raw sequences and parsing them again locally.

To run the Nextclade workflow, use the following command:

nextstrain build . upload_all_nextclade_files --configfile profiles/nextclade.yaml

This uploads Nextclade annotations and alignment files to S3 per lineage and segment like seasonal-flu/vic/ha/nextclade.tsv.xz and seasonal-flu/vic/ha/aligned.fasta.xz.

This PR does not produce a merged metadata and Nextclade annotations file like the one used in the flu_frequencies workflow or the analogous ncov metadata files with Nextclade annotations included. I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.

Related issue(s)

Related to #144

Checklist

  • Checks pass
  • GitHub Action runs and deploys data to S3 as expected

huddlej added 4 commits March 25, 2024 16:56
Adds rules, config, and GitHub Action file to support running Nextclade
on all available sequences. Not yet tested.
Remove unnecessary configuration parameters from the Nextclade build
config and update the workflow to allow these parameters to be missing.
Since Snakemake evaluates the Python code in each rule's inputs,
outputs, and params, rules that we don't plan to run in the workflow can
produce key errors when their config parameters are not defined.
Simplifies the logic to get Nextclade datasets by following the same
pattern as the flu_frequencies workflow [1] where we grab the default
dataset for a given lineage and segment instead of specifying a
reference name. The "broad" and more recent references for H3N2 HA, for
example, are not too different from each other, but the Nextclade
annotations for the former are far more verbose than for the latter. We
also want the files produced by this workflow to plug directly into the
flu_frequencies workflow logic, so it is best to use the same approach
here.

[1] https://github.com/nextstrain/flu_frequencies/blob/6e4298fac3361f4a6751d85bcb963064dbb9eee1/Snakefile#L95
Adds NA to list of segments, since we want to know the subclade
annotations for NA as well as HA and use these data to estimate
frequencies.
@huddlej huddlej marked this pull request as ready for review March 26, 2024 16:46
@huddlej huddlej requested a review from rneher March 26, 2024 16:54
@joverlee521
Copy link
Contributor

Not doing an in-depth review, just reading over this PR because I'm interested in how you handled the different segments.

I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

It'd be good to discuss what's the easiest for downstream users here (granted that may currently only be the flu-frequencies workflow?).

Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

Add temporary trigger for Nextclade workflow on PR event. This should
trigger the workflow when I push the update to the PR. If it works, I
should drop this commit again.
@huddlej
Copy link
Contributor Author

huddlej commented Mar 26, 2024

It'd be good to discuss what's the easiest for downstream users here

Totally agree, @joverlee521! @plsteinberg is one such downstream user who has to manually join the metadata and HA Nextclade annotations for their project, so we might have an idea of how that could be improved based on their feedback. 😄

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

Gah, I knew this existed but forgot the mechanics. Thank you for the reminder! Trying with Victor's approach now.

huddlej added 2 commits March 26, 2024 14:51
The workflow ran successfully, so removing this trigger.
Reduce threads requested for Nextclade runs from 16 to 12 so we can run 3 Nextclade jobs at once (one per lineage) on a 36-core instance of AWS Batch.
@huddlej huddlej changed the title Prototype GitHub Action for Nextclade annotations Add GitHub Action for Nextclade annotations Mar 30, 2024
@huddlej
Copy link
Contributor Author

huddlej commented Apr 1, 2024

I'm going to merge this now, so I can start running the GitHub Action after our weekly ingests. We should continue to discuss this implementation in the future, though, since there is likely still remove to improve the user experience.

@huddlej huddlej merged commit a162342 into master Apr 1, 2024
3 checks passed
@huddlej huddlej deleted the add-nextclade-github-action branch April 1, 2024 17:03
@joverlee521 joverlee521 mentioned this pull request May 13, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants